Lab 08c: Recommender systems

Introduction

In this lab, you will build a simple movie recommender using $k$ nearest neighbours regression. At the end of the lab, you should be able to:

  • Replace missing values in a data set.
  • Create a $k$ nearest neighbours regression model.
  • Use the model to predict new values.
  • Measure the accuracy of the model.

Getting started

Let's start by importing the packages we'll need. This week, we're going to use the neighbors subpackage from scikit-learn to build $k$ nearest neighbours models.


In [ ]:
%matplotlib inline
import pandas as pd

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict
from sklearn.neighbors import KNeighborsRegressor

Next, let's load the data. Write the path to your ml-100k.csv file in the cell below:


In [ ]:
path = 'data/ml-100k.csv'

Execute the cell below to load the CSV data into a pandas data frame indexed by the user_id field in the CSV file.


In [ ]:
df = pd.read_csv(path, index_col='user_id')
df.head()

Exploratory data analysis

Let's start by computing some summary statistics about the data:


In [ ]:
stats = df.describe()
stats

As can be seen, the data consists of film ratings in the range [1, 5] for 1664 films. Some films have been rated by many users, but the vast majority have been rated by only a few (i.e. there are many missing values):


In [ ]:
ax = stats.loc['count'].hist(bins=30)
ax.set(
    xlabel='Number of ratings',
    ylabel='Frequency'
);

We'll need to replace the missing values with appropriate substitutions before we can build our model. One way to do this is to replace each instance where a user didn't see a film with the average rating of that film (although, there are others, e.g. the median or mode values). We can compute the average rating of each film via the mean method of the data frame:


In [ ]:
average_ratings = df.mean()

average_ratings.head()

Next, let's substitute these values everywhere there is a missing value. With pandas, you can do this with the fillna method, as follows:


In [ ]:
df = df.fillna(value=average_ratings)

Data modelling

Let's build a movie recommender using user-based collaborative filtering. For this, we'll need to build a model that can identify the most similar users to a given user and use that relationship to predict ratings for new movies. We can use $k$ nearest neighbours regression for this.

Before we build the model, let's specify ratings for some of the films in the data set. This gives us a target variable to fit our model to. The values below are just examples - feel free to add your own ratings or change the films.


In [ ]:
y = pd.Series({
    'L.A. Confidential (1997)': 3.5,
    'Jaws (1975)': 3.5,
    'Evil Dead II (1987)': 4.5,
    'Fargo (1996)': 5.0,
    'Naked Gun 33 1/3: The Final Insult (1994)': 2.5,
    'Wings of Desire (1987)': 5.0,
    'North by Northwest (1959)': 5.0,
    "Monty Python's Life of Brian (1979)": 4.5,
    'Raiders of the Lost Ark (1981)': 4.0,
    'Annie Hall (1977)': 5.0,
    'True Lies (1994)': 3.0,
    'GoldenEye (1995)': 2.0,
    'Good, The Bad and The Ugly, The (1966)': 4.0,
    'Empire Strikes Back, The (1980)': 4.0,
    'Godfather, The (1972)': 4.5,
    'Waterworld (1995)': 1.0,
    'Blade Runner (1982)': 4.0,
    'Seven (Se7en) (1995)': 3.5,
    'Alien (1979)': 4.0,
    'Free Willy (1993)': 1.0
})

Next, let's select the features to learn from. In user-based collaborative filtering, we need to identify the users that are most similar to us. Consequently, we need to transpose our data matrix (with the T attribute of the data frame) so that its columns (i.e. features) represent users and its rows (i.e. samples) represent films. We'll also need to select just the films that we specified above, as our target variable consists of these only.


In [ ]:
X = df.T.loc[y.index]

X.head()

Let's build a $k$ nearest neighbours regression model to see what improvement can be made over the dummy model:


In [ ]:
algorithm = KNeighborsRegressor()

parameters = {
    'n_neighbors': [2, 5, 10, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['manhattan', 'euclidean']
}

# Use inner CV to select the best model
inner_cv = KFold(n_splits=10, shuffle=True, random_state=0)  # K = 10

clf = GridSearchCV(algorithm, parameters, cv=inner_cv, n_jobs=-1)  # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)

# Use outer CV to evaluate the error of the best model
outer_cv = KFold(n_splits=10, shuffle=True, random_state=0)  # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)

# Print the results 
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())

ax = (y - y_pred).hist()
ax.set(
    title='Distribution of errors for the nearest neighbours regression model',
    xlabel='Error'
);

As can be seen, the $k$ nearest neighbours model is able to predict ratings to within ±0.88, with a standard deviation of 0.97. While this error is not small, it's not so large that it won't be useful. Further impovements can be made by filling the missing values in a different way or providing more ratings.

Making predictions

Now that we have a final model, we can make recommendations about films we haven't rated:


In [ ]:
predictions = pd.Series()
for film in df.columns:
    if film in y.index:
        continue  # If we've already rated the film, skip it
    predictions[film] = clf.predict(df.loc[:, [film]].T)[0]

predictions.sort_values(ascending=False).head(10)